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Learning to Answer Yes/No 


Roadmap 
@ When Can Machines Learn? 


Lecture 1: The Learning Problem 
A takes D and H to get g 


Lecture 2: Learning to Answer Yes/No 


ə Perceptron Hypothesis Set 
Perceptron Learning Algorithm (PLA) 
Guarantee of PLA 
Non-Separable Data 







@ Why Can Machines Learn? 
© How Can Machines Learn? 
@ How Can Machines Learn Better? 
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Learning to Answer Yes/No Perceptron Hypothesis Set 


Credit Approval Problem Revisited 


Applicant Information 
































age 23 years 
gender female 
i annual salary NTD 1,000,000 
a target function ) year in residence 1 year 
Bed year in job 0.5 year 
(ideal credit approval formula) current debt 200,000 








| 


training examples 
D: (x1; y1), peer , (Xn, Yn) 


(historical records in bank) 



















learning 
algorithm 
A 


final hypothesis 
gat 




















(‘learned’ formula to be used) 


hypothesis set 
H 


(set of candidate formula) 


what hypothesis set can we use? 


Learning to Answer Yes/No Perceptron Hypothesis Set 


A Simple Hypothesis Set: the ‘Perceptron’ 














age 23 years 
annual salary | NTD 1,000,000 

year in job 0.5 year 

current debt 200,000 

















e For x = (X1, Xo,--- ,Xq) ‘features of customer’, compute a 
weighted ‘score’ and 


d 
approve credit if ee w;xX; > threshold 


d 
deny credit if ew w;xi < threshold 


ey: {+1 (good), —1 (bad)}, 0 ignored—linear formula h € H are 
d 


h(x) = sign (>: mx) = teshot) 


i=1 





called ‘perceptron’ hypothesis historically | 
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Learning to Answer Yes/No Perceptron Hypothesis Set 


Vector Form of Perceptron Hypothesis 


sign (>: mx) -treshld) 
i= 


d 
sign (>: mx) + (threshold) threshold) - (+1) 
i=1 — Xo 


h(x) 


i=0 
sign (wx) 


e each ‘tall’ w represents a hypothesis h & is multiplied with 
‘tall’ x —will use tall versions to simplify notation 






what do perceptrons h ‘look like’? | 
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Learning to Answer Yes/No Perceptron Hypothesis Set 


Perceptrons in R? 
h(x) = sign (Wo + Wy xy + W2X2) J 














e customer features x: points on the plane (or points in R?) 
e labels y: o (+1), x (-1) 

e hypothesis h: lines (or hyperplanes in R?) 
—positive on one side of a line, negative on the other side 
different line classifies customers differently 





perceptrons <= linear (binary) classifiers | 
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Learning to Answer Yes/No Perceptron Hypothesis Set 


Fun Time 





Consider using a perceptron to detect spam messages. 


Assume that each email is represented by the frequency of keyword 
occurrence, and output +1 indicates a spam. Which keywords below 
shall have large positive weights in a good perceptron for the task? 


@ coffee, tea, hamburger, steak 
@ free, drug, fantastic, deal 

© machine, learning, statistics, textbook 
@ national, Taiwan, university, coursera 










Learning to Answer Yes/No Perceptron Hypothesis Set 


Fun Time 


Consider using a perceptron to detect spam messages. 


Assume that each email is represented by the frequency of keyword 
occurrence, and output +1 indicates a spam. Which keywords below 
shall have large positive weights in a good perceptron for the task? 


@ coffee, tea, hamburger, steak 

@ free, drug, fantastic, deal 

© machine, learning, statistics, textbook 
© national, Taiwan, university, coursera J 








Reference Answer: @ 


The occurrence of keywords with positive 
weights increase the ‘spam score’, and hence 
those keywords should often appear in spams. 
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Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Select g from H 
H = all possible perceptrons, g =? 

















e want: g ~ f (hard when f unknown) R 
7 (0) 
e almost necessary: g ~ f on D, ideally ž A 
9(Xn) = f(Xn) = Yn A 
e difficult: H is of infinite size 5 
e idea: start from some go, and ‘correct’ its i ~ 
mistakes on D x 











will represent go by its weight vector Wo 
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Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Perceptron Learning Algorithm 


start from some Wọ (Say, 0), and ‘correct’ its mistakes on D | 


For t=0,1,... y=+1 
@ find a mistake of w; called (Xn), Yn(t)) 

















sign (wx: } F Yno) 














@ (try to) correct the mistake by 





Wii We + Yn(t)Xnct) 


... until no more mistakes 
return last w (called Wp.) as g 


That’s it! 
—A fault confessed is half redressed. :-) | 





Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Practical Implementation of PLA 


start from some Wo (Say, 0), and ‘correct’ its mistakes on D 


Fort =0,1,... 
@ find the next mistake of w; called (Xnr); Ynct)) 


sign (wixa) F Yno) 
@ correct the mistake by 


Writ — We + Vat)Xn(t) 


... until a full cycle of not encountering mistakes 


next can follow naive cycle (1,--- , N) 
or precomputed random cycle 


Hsuan-Tien Lin (NTU CSIE) 








9/22 


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Seeing is Believing 





x initially 
x 
x 
x x o 
x 
Xy% O 
O 
O O 
x O 








worked like a charm with < 20 lines!! 
(note: made x; >> Xo = 1 for visual purpose) 


HevanTien Un NU) M 005 












s update: 1 

















Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Seeing is Believing 





x update: 2 














worked like a charm with < 20 lines!! 
(note: made x; >> Xo = 1 for visual purpose) 
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Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Seeing is Believing 

















worked like a charm with < 20 lines!! 
(note: made x; >> Xo = 1 for visual purpose) 






Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Seeing is Believing 





x update: 4 














worked like a charm with < 20 lines!! 
(note: made x; œ> Xo = 1 for visual purpose) 
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Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Seeing is Believing 





x update: 5 














worked like a charm with < 20 lines!! 
(note: made x; >> Xo = 1 for visual purpose) 
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Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Seeing is Believing 





x update: 6 














worked like a charm with < 20 lines!! 
(note: made x; œ> Xo = 1 for visual purpose) 






Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Seeing is Believing 





x update: 7 














worked like a charm with < 20 lines!! 
(note: made x; œ> Xo = 1 for visual purpose) 


HeuanTien tn NUS) M 05 






Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Seeing is Believing 





x update: 8 














worked like a charm with < 20 lines!! 
(note: made x; >> Xo = 1 for visual purpose) 


HeuanTien ln NU) a 05 






Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Seeing is Believing 





x update: 9 














worked like a charm with < 20 lines!! 
(note: made x; > Xo = 1 for visual purpose) 


HevanTien Un AUC) D 305 









finally 








(note: made x; >> Xo = 1 for visual purpose) 





Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Some Remaining Issues of PLA 
‘correct’ mistakes on D until no mistakes j 


Algorithmic: halt (with no mistake)? 
e naïve cyclic: ?? 








e random cyclic: ?? 
e other variant: ?? 


Learning: g ~ f? 





e on D, if halt, yes (no mistake) 
e outside D: ?? 
e if not halting: ?? 


[to be shown] if (...), after ‘enough’ corrections, 
any PLA variant halts 


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Fun Time 








Let’s try to think about why PLA may work. 
Let n = n(t), according to the rule of PLA below, which formula is true? 


sign (w7 xn) # Yn, Wii — Wi YnXn 


@ Wi 4Xn = yn 

(2) sign(w/,, ;Xn) = Yn 
(3) Yaw? .4Xn > YnW] Xn 
© yaw], Xn < YnW{ Xn 


Learning to Answer Yes/No Perceptron Learning Algorithm (PLA) 


Fun Time 





Let’s try to think about why PLA may work. 
Let n = n(t), according to the rule of PLA below, which formula is true? 


sign (wi xn) Z Yn, Wii <— Wet YnXn 


@ Wii 1Xn = yn 

(2) sign(w/,,Xn) = Jn 
@ YnW i, 4Xn È YnW] Xn 
4) YaW 1 .4Xn < Yaw) Xn 








Reference Answer: © 


Simply multiply the second part of the rule by 
YnXn. The result shows that the rule 
somewhat ‘tries to correct the mistake.’ 
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Learning to Answer Yes/No Guarantee of PLA 


Linear Separability 


e if PLA halts (i.e. no more mistakes), 
(necessary condition) D allows some w to make no mistake 


e call such D linear separable 



































(linear separable) (not linear separable) (not linear separable) 


assume linear separable D, 
does PLA always halt? | 
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Learning to Answer Yes/No Guarantee of PLA 


PLA Fact: w; Gets More Aligned with w; 


linear separable D & exists perfect w; such that yn = sign(w/x,) | 


e w; perfect hence every x, correctly away from line: 


Yn WF Xnct) >min YnW]Xn >0 


e w/w; t by updating with any (Xnet) Ynct)) 


a 
Wi Wi 


w} (Wi + Ynct)Xn(t)) 
Twi + min YnW Í Xn 


IV 


Wi 


w/w; +0. 


w; appears more aligned with w; after update 
(really?) | 





Learning to Answer Yes/No Guarantee of PLA 


PLA Fact: w; Does Not Grow Too Fast 


w; Changed only when mistake 
S sign (WF Xn) A Yn) E Yn Wi Xn <O | 
e mistake ‘limits’ ||w;||? growth, even when updating with ‘longest’ Xn 


waal? = [We + YacXnc ll? 
[wil] + Zyn W? Xna + nce) Xna I? 
Iwll? +0 + IYn Xna l? 

[well? + max IyoXall? 


IA 










start from Wọ = 0, after T mistake corrections, 


w wr 
lwz] Iwz] 


> VT - constant 








Learning to Answer Yes/No Guarantee of PLA ; 
Fun Time 
Let’s upper-bound 7, the number of mistakes that PLA ‘corrects’. 
Define R? = max ||Xxn||? p= min y, “fx 
es a o w” 


We want to show that T < O. Express the upper bound O by the two 
terms above. 


@ F/p 

@ R?/ P 
© R/®° 
O P/R 





Learning to Answer Yes/No Guarantee of PLA 


Fun Time 
Let’s upper-bound 7, the number of mistakes that PLA ‘corrects’. 





w 
Define R? = max ||x,||2. p = min y, ——x 
n | nll P n Yniw n 


We want to show that T < O. Express the upper bound O by the two 
terms above. 


© R/p 

@ F?/P? 
© R/° 
O °/R 
































Reference Answer: © 


The maximum value of U is 1. Since T 
mistake corrections increase the inner 
product by vT. constant, the maximum 
number of corrected mistakes is 1 /constant?. 
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Learning to Answer Yes/No Non-Separable Data 


More about PLA 


Guarantee 

as long as linear separable and correct by mistake 
e inner product of w; and w; grows fast; length of w; grows slowly 
e PLA ‘lines’ are more and more aligned with w; = halts 








simple to implement, fast, works in any dimension d 








‘assumes’ linear separable D to halt 

—property unknown in advance (no need for PLA if we know wy) 
not fully sure how long halting takes (p depends on wy) 
—though practically fast 





what if D not linear separable? | 
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Learning to Answer Yes/No Non-Separable Data 


Learning with Noisy Data 


unknown target function 
f: X> y 
+ noise 








(ideal credit approval formula) 


| 

















training examples bares final hypothesis 
D: (X1, 1), s (XN, YN) i get 

























(historical records in bank) (‘learned’ formula to be used) 


hypothesis set 
H 


(set of candidate formula) 


how to at least get g ~ f on noisy D? 
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Learning to Answer Yes/No Non-Separable Data 


Line with Noise Tolerance 





e assume ‘little’ noise: yn = f(x,) usually 
e if so, g ~ f on D & yn = g(Xn) usually 
e how about 


N 
Wg + argmin `> [yn # sign(w"xn)| 


n=1 





—NP-hard to solve, unfortunately 








can we modify PLA to get 
an ‘approximately good’ g? 
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Non-Separable Data 


Pocket Algorithm 


modify PLA algorithm (black lines) by keeping best weights in pocket 


initialize pocket weights W 
Fort =0,1,--- 
@ find a (random) mistake of w+ called (Xni), Yncty) 


@ (try to) correct the mistake by 


Learning to Answer Yes/No 


Wry — We + Ynt)Xn(t) 


© if w; makes fewer mistakes than W, replace W by w;,; 


...until enough iterations 
return Ŵ (called Wpocxer) aS J 


a simple modification of PLA to find 
(somewhat) ‘best’ weights | 


Machine Learning Foundations 
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Learning to Answer Yes/No Non-Separable Data 


Fun Time 





Should we use pocket or PLA? 


Since we do not know whether D is linear separable in advance, we 
may decide to just go with pocket instead of PLA. If D is actually linear 
separable, what’s the difference between the two? 


© pocket on D is slower than PLA 
© pocket on D is faster than PLA 
© pocket on D returns a better g in approximating f than PLA 
© pocket on D returns a worse g in approximating f than PLA 
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Learning to Answer Yes/No Non-Separable Data 


Fun Time 


Should we use pocket or PLA? 


Since we do not know whether D is linear separable in advance, we 
may decide to just go with pocket instead of PLA. If D is actually linear 
separable, what’s the difference between the two? 


© pocket on D is slower than PLA 
© pocket on D is faster than PLA 
© pocket on D returns a better g in approximating f than PLA 
© pocket on D returns a worse g in approximating f than PLA | 











Reference Answer: (1) 


Because pocket need to check whether w;, is 
better than w in each iteration, it is slower than 
PLA. On linear separable D, Weocxer is the 
same aS Wpıa, both making no mistakes. 
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Learning to Answer Yes/No Non-Separable Data 


Summary 
@ When Can Machines Learn? 






Lecture 1: The Learning Problem 
Lecture 2: Learning to Answer Yes/No 


ə Perceptron Hypothesis Set 
hyperplanes/linear classifiers in R” 

ə Perceptron Learning Algorithm (PLA) 

correct mistakes and improve iteratively 
ə Guarantee of PLA 

no mistake eventually if linear separable 
e Non-Separable Data 
hold somewhat ‘best’ weights in pocket 











e next: the zoo of learning problems 
@ Why Can Machines Learn? 
© How Can Machines Learn? 
@ How Can Machines Learn Better? 
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